11 Randomness and Complexity
123
incompressible; that is, it cannot be recreated by any means shorter than the process
actually used to generate it in the first place.
We have previously stated that bioinformatics could be considered to be the study
of the departures from randomness of DNA. We are shown a sequence of DNA:
Is it random? We want to be able to quantify its departure from randomness. Pre-
sumably those sequences belonging to viable organisms, or even to their individual
proteins or promoter sequences, are not random. What about introns, and intergenome
sequences? If they are indeed “junk”, as is sometimes (facetiously?) asserted, then
we might well expect them to be random. Even if they started their existence as
nonrandom sequences, they may have been randomized since they would be subject
to virtually no selection pressure. Mutations are supposed to be random and occur at
random places. The opposite procedure would be that all DNA sequences started as
random ones and then natural selection eliminated many according to some system-
atic criterion; therefore, the extant collection of the DNA of viable organisms on this
planet is not random. Can we, then, say anything about the randomness or otherwise
of an individual sequence taken in isolation?
Similar considerations apply to proteins. Given a collection of amino acid
sequences of proteins (which, to be meaningful, should come from the same genome),
we can assess the likelihood that they arose by chance and the degree of their depar-
tures from randomness.
All such sequences can be idealized as sequences of Bernoulli trials (see
Sect. 9.2.3), which are themselves abstractions of a coin tossing experiment. Since
order does not matter in determining the probability of a given overall outcome, 50
heads followed by 50 tails has the same probability of occurring as 50 alternations
of heads and tails, which again is no less probable than a particular realization in
which the heads and tails are “randomly” mixed.
Any nonbinary sequence can, of course, be encoded in binary form. Typical
procedures for biological sequences (amino acids or nucleotides) are to consider
nucleotides as purines (0) or pyrimidines (1), or amino acids as hydrophobic (apolar)
or hydrophilic (polar) residues (cf. Markov’s encoding of poetry as a sequence of
vowels and consonants). Alternatively, the nucleotides could constitute a sequence
in base 4 (A identical to≡0, C identical to≡1, T identical to≡2, G identical to≡3), which can then be converted to base 2.
It is a commonly held belief that after a long sequence of heads (say), the opposite
result (tails) becomes more probable. There is no empirical support for this assertion
in the case of coin tossing. In other situations in which the outcome depends on
selecting elements from a finite reservoir, however, clearly this result must hold.
Thus, if a piece of DNA is being assembled from a soup of base monomers at
initially equal concentrations, if by chance the sequence starts out by being poor in
A, say, then later on this must be compensated by enrichment (chain elongation ends
when all available nucleotides have been consumed).
Formal Notions of Randomness In order to proceed further, we need to more
carefully understand what we mean by randomness. Despite the fact that the man in
the street supposes that he has a good idea of what it means, randomness is a rather
delicate concept. The toss of an unbiased coin is said to be random; the probability